Oracle Database 12.2, What an unimaginable horror!

English

Oracle Database 12.2.

It is close to 25 million(百万) lines of C code.

What an unimaginable horror! You can’t change a single line of code in the product without breaking 1000s of existing tests. Generations of programmers have worked on that code under difficult deadlines and filled the code with all kinds of crap(废物,屎).

Very complex pieces of logic, memory management, context switching, etc. are all held together with thousands of flags. The whole code is ridden(充斥) with mysterious macros that one cannot decipher(理解,解释) without picking a notebook and expanding relevant pats of the macros by hand. It can take a day to two days to really understand what a macro does.

Sometimes one needs to understand the values and the effects of 20 different flag to predict(预测) how the code would behave in different situations(情况). Sometimes 100s too! I am not exaggerating(夸张).

The only reason why this product is still surviving and still works is due to literally(正确地) millions of tests!

Here is how the life of an Oracle Database developer is:

  • Start working on a new bug.

  • Spend two weeks trying to understand the 20 different flags that interact in mysterious ways to cause this bag.

  • Add one more flag to handle the new special scenario. Add a few more lines of code that checks this flag and works around the problematic(有问题的) situation and avoids the bug.

  • Submit the changes to a test farm(集群) consisting of about 100 to 200 servers that would compile the code, build a new Oracle DB, and run the millions of tests in a distributed fashion.

  • Go home. Come the next day and work on something else. The tests can take 20 hours to 30 hours to complete.

  • Go home. Come the next day and check your farm test results. On a good day, there would be about 100 failing tests. On a bad day, there would be about 1000 failing tests. Pick some of these tests randomly and try to understand what went wrong with your assumptions(假设,设想). Maybe there are some 10 more flags to consider to truly understand the nature of the bug.

  • Add a few more flags in an attempt to fix the issue. Submit the changes again for testing. Wait another 20 to 30 hours.

  • Rinse and repeat for another two weeks until you get the mysterious incantation of the combination of flags right.

  • Finally one fine day you would succeed with 0 tests failing.

  • Add a hundred more tests for your new change to ensure that the next developer who has the misfortune(不幸) of touching this new piece of code never ends up breaking your fix.

  • Submit the work for one final round of testing. Then submit it for review. The review itself may take another 2 weeks to 2 months. So now move on to the next bug to work on.

  • After 2 weeks to 2 months, when everything is complete, the code would be finally merged into the main branch.

The above is a non-exaggerated description of the life of a programmer in Oracle fixing a bug. Now imagine what horror it is going to be to develop a new feature. It takes 6 months to a year (sometimes two years!) to develop a single small feature (say something like adding a new mode of authentication like support for AD authentication).

The fact that this product even works is nothing short of a miracle! (short of 除..以外)

I don’t work for Oracle anymore. Will never work for Oracle again!

Chinese

Oracle 数据库 12.2。它有近 2500 万行 C 代码。

这有多恐怖,简直难以想象!你无法在不破坏成千上万个现有测试的情况下更改产品中的单行代码。好几代程序员在有限的项目期限内编写了这些代码,其中充斥着大量的垃圾代码。

非常复杂的逻辑、内存管理、上下文切换等,这些都用数千个 flag 连接起来。整个代码充斥着神秘的宏命令,如果不拿出笔记本,并且手动去展开相关的宏命令,就无法理清楚这些命令。甚至可能需要一两天才能真正理解某个宏命令的作用。

有时你需要理顺 20 个不同 flag 的值和效果来预测代码在不同情况下的行为方式。有时多达数百个 flag !这一点也不夸张。

这个产品仍然存活并且仍然可用的唯一原因是数百万次的测试!

以下是 Oracle 数据库开发人员的日常:

开始处理一个新的 bug 。
花两周的时间试图理解 20 个不同的 flag ,这些 flag 以神秘的方式相互交互,导致这个困境。
再添加一个 flag 来处理新的特殊场景。添加几行代码来检查此 flag ,并解决有问题的情况,规避该 bug 。
将更改提交到包含大约 100-200 台服务器的测试服务器集群,这些服务器将编译代码,构建新的 Oracle 数据库,并以分布式方式运行数百万个测试。
回家。第二天来上班,继续处理别的 bug 。测试可能需要 20-30 个小时才能完成。
再回家。再来上班,检查你的集群测试结果。顺利的话,会有大约 100 个失败的测试。倒霉的话,将有大约 1000 个失败的测试。随机选择一些测试并试图搞清楚你的假设出了什么问题。或许还需要考虑 10 多个 flag 才能真正理解 bug 的本质。
再添加一些 flag 以尝试解决问题。再次提交更改以进行测试。再等 20-30 个小时。
来来回回重复两周,直到你得到了将这些 flag 组合起来的“神秘咒语”。
终有一天,你会成功,不再出现测试失败。
为你的新更改添加 100 多个测试,以确保下一个不幸接触这段新代码的开发人员永远不会破坏你的修复。
提交最后一轮测试的成果。然后提交以供审核。审查本身可能还需要 2 周到 2 个月。所以接下来继续去处理下一个 bug 。
在 2 周到 2 个月之后,一切已就绪,代码将最终合并到主分支中。
以上就是对在 Oracle 修复 bug 的程序员日常生活的描述,一点也不夸张。现在想象一下开发新功能会有多么恐怖。开发一个小功能需要 6 个月到 1 年的时间(如果是添加一种新的身份验证模式,比如支持 AD 身份验证,可能需要 2 年)。

这款产品本身就是一个奇迹!

我不再为 Oracle 工作了。永远不会再为 Oracle 工作了!